Will AI Systems Run Out of Publicly Available Data on the Internet?
2024-06-13
LRC
TXT
大字
小字
滚动
全页
1A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.
2Training data includes writing and information publicly available on the Internet.
3AI companies use the internet to "train" AI systems to create human-sounding writing.
4This "training" is what developers use to create large language models.
5Currently, many technology companies are developing large language models this way.
6The nonprofit research group Epoch AI examines issues relating to AI.
7It has been following the development of large language models for a few years.
8In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.
9The team's latest paper has been reviewed by experts, or peer reviewed.
10It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer.
11Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.
12Researcher Tamay Besiroglu is one of the paper's writers.
13He compared the current situation to a "gold rush" in which limited resources are depleted.
14He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.
15As a result, technology companies like the maker of ChatGPT, OpenAI and Google are seeking to pay for high quality data.
16Their goal is to ensure a flow of good material to train their systems.
17OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material.
18The researchers consider this a short-term answer.
19Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development.
20That could lead companies to seek online data considered private, such as email and phone communications.
21They also might increasingly use AI-created data, such as chatbot content.
22Besiroglu described the issue as a "bottleneck" that can prevent companies from making improvements to their AI models, a process called "scaling up."
23"...Scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output."
24The Epoch AI group first made their predictions two years ago.
25That was weeks before the release of ChatGPT.
26At the time, the group said "high-quality language data" would be exhausted by 2026.
27Since then, AI researchers have developed new methods that make better use of data and that "overtrain" models on the same data many times.
28But there are limits to such methods.
29While the amount of written information that is fed into AI systems has been growing, so has computing power, Epoch AI said.
30The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.
31But whether a "bottleneck" in development is a concern remains the subject of debate.
32Nicolas Papernot teaches computer engineering at the University of Toronto.
33He was not involved in the Epoch study.
34He said building more skilled AI systems can come from training them for specialized tasks.
35Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as "model collapse."
36Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models.
37Wikipedia has placed few restrictions on how AI companies use its articles, which are written by volunteers.
38But professional writers are worried about their protected materials.
39Last fall, 17 writers brought a legal action against Open AI for what they called "systematic theft on a mass scale."
40They said ChatGPT was using their materials, which are protected by copyright laws, without permission.
41AI developers are concerned about the quality of what they train their systems on.
42Epoch AI's study noted that paying millions of humans to write for AI models "is unlikely to be an economical way" to improve performance.
43The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with "generating lots of synthetic data" for training.
44He said both humans and machines produce high- and low-quality data.
45Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.
46"There'd be something very strange if the best way to train a model was to just generate...synthetic data and feed that back in," Altman said.
47"Somehow that seems inefficient."
48I'm Caty Weaver.
49And I'm Mario Ritter, Jr.
1A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years. 2Training data includes writing and information publicly available on the Internet. AI companies use the internet to "train" AI systems to create human-sounding writing. This "training" is what developers use to create large language models. Currently, many technology companies are developing large language models this way. 3The nonprofit research group Epoch AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032. 4The team's latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California. 5A 'gold rush' 6Researcher Tamay Besiroglu is one of the paper's writers. He compared the current situation to a "gold rush" in which limited resources are depleted. He said the field of AI might face problems as the current speed of development uses up the current supply of human writing. 7As a result, technology companies like the maker of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer. 8Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content. 9A 'bottleneck' in development? 10Besiroglu described the issue as a "bottleneck" that can prevent companies from making improvements to their AI models, a process called "scaling up." 11"...Scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output." 12The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said "high-quality language data" would be exhausted by 2026. Since then, AI researchers have developed new methods that make better use of data and that "overtrain" models on the same data many times. But there are limits to such methods. 13While the amount of written information that is fed into AI systems has been growing, so has computing power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens. 14But whether a "bottleneck" in development is a concern remains the subject of debate. 15Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as "model collapse." 16Permission and quality 17Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions on how AI companies use its articles, which are written by volunteers. 18But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called "systematic theft on a mass scale." They said ChatGPT was using their materials, which are protected by copyright laws, without permission. 19AI developers are concerned about the quality of what they train their systems on. Epoch AI's study noted that paying millions of humans to write for AI models "is unlikely to be an economical way" to improve performance. 20The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with "generating lots of synthetic data" for training. He said both humans and machines produce high- and low-quality data. 21Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models. 22"There'd be something very strange if the best way to train a model was to just generate...synthetic data and feed that back in," Altman said. "Somehow that seems inefficient." 23I'm Caty Weaver. 24And I'm Mario Ritter, Jr. 25Matt O'Brien reported this story for the Associated Press. Mario Ritter, Jr. adapted it for VOA Learning English. 26_______________________________________________________ 27Words in This Story 28exhaust -v. to completely use up a resource 29depleted -adj. when a resource is almost used up 30trajectory -n. the direction that something is taking or is predicted to take 31synthetic -adj. created by a process that is not natural 32scale -n. the level of size of a thing 33generate -v. to create something through a process 34We want to hear from you. 35Our comment policy is here.